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Abstract —Polar codes are of great interests because they 
provably achieve the capacity of both discrete and continuous 
memoryless channels while having an explicit construction. Most 
existing decoding algorithms of polar codes are based on bit-wise 
hard or soft decisions. In this paper, we propose symbol-decision 
successive cancellation (SC) and successive cancellation list (SCL) 
decoders for polar codes, which use symbol-wise hard or soft 
decisions for higher throughput or better error performance. 
First, we propose to use a recursive channel combination to 
calculate symbol-wise channel transition probabilities, which lead 
to symbol decisions. Our proposed recursive channel combination 
also has a lower complexity than simply combining bit-wise 
channel transition probabilities. The similarity between our 
proposed method and Arikan’s channel transformations also 
helps to share hardware resources between calculating bit- and 
symbol-wise channel transition probabilities. Second, a two-stage 
list pruning network is proposed to provide a trade-off between 
the error performance and the complexity of the symbol-decision 
SCL decoder. Third, since memory is a significant part of SCL 
decoders, we propose a pre-computation memory-saving tech¬ 
nique to reduce memory requirement of an SCL decoder. Finally, 
to evaluate the throughput advantage of our symbol-decision 
decoders, we design an architecture based on a semi-parallel 
successive cancellation list decoder. In this architecture, different 
symbol sizes, sorting implementations, and message scheduling 
schemes are considered. Our synthesis results show that in terms 
of area efficiency, our symbol-decision SCL decoders outperform 
both bit- and symbol-decision SCL decoders. 

Index Terms —Error control codes, polar codes, successive 
cancellation, list decoding algorithm, hardware implementation 

I. Introduction 

Polar codes, a groundbreaking finding by Arikan [JTJ in 
2009, have ignited a spark of research interest in the fields of 
communication and coding theory, because they can provably 
achieve the capacity for both discrete {T| and continuous |2j 
memoryless channels. The second reason why polar codes are 
attractive is their low encoding and decoding complexity. For 
example, a polar code of length N can be decoded by the 
successive cancellation (SC) algorithm 11 i| with a complexity 
of 0(N log N). However, their capacity approaching can be 
achieved only when the code length is large enough (TV > 2 20 
(Jj) if the SC algorithm is used. For short or moderate code 
length, in terms of the error performance, polar codes with 
the SC algorithm are inferior to Turbo codes or low-density 
parity-check (LDPC) codes 

Since the debut of polar codes, a lot of efforts have been 
made to improve the error performance of short polar codes. 
Systematic polar codes j 6 j were proposed to reduce the bit 
error rate (BER) while guaranteeing the same frame error 
rate (FER) as their non-systematic counterparts. Although a 
Viterbi algorithm [j7j, a sphere decoding algorithm | 8 ] and 


stack sphere decoding algorithm 0 can provide maximum 
likelihood (ML) decoding of polar codes, they are considered 
infeasible, especially for long polar codes, due to their much 
higher complexity than the SC algorithm. Recently, an SC 
list algorithm for polar codes was proposed in © to bridge 
the performance gap between the SC algorithm and ML 
algorithms at the cost of complexity of 0(LN log N), where 
L is the list size. Moreover, the concatenation of polar codes 
with cyclic redundancy check (CRC) codes was introduced in 
0.GD- To decode the CRC-concatenated polar codes, a CRC 
detector is used in the SCL algorithm to help select the output 
codeword. The combination of an SCL algorithm and a CRC 
detector is called CRC-aided SCL (CA-SCL) algorithm. £0) 
shows that with the CA-SCL algorithm, the error performance 
of a (2048, 1024) CRC-concatenated polar code is better that 
of a (2304, 1152) LDPC code, which is used in the WiMax 
standard 01 - 

Several architectures have been proposed for the SC algo¬ 
rithm. Arikan 0 showed that a fully parallel SC decoder has 
a latency of 2 N — 1 clock cycles. A tree SC decoder and a 
line SC decoder with complexity of O(N) were proposed in 
©■ These two decoders have the same latency as the fully 
parallel SC decoder. To reduce complexity further, Leroux 
et at. 0 proposed a semi-parallel SC decoder for polar codes 
by taking advantage of the recursive structure of polar codes 
to reuse processing resources. Assuming that the number of 
processing elements (PEs) are P (P = 2 P < N), the latency 
of the semi-parallel SC decoder is 2 N + 77 log 2 ( jp ) clock 
cycles. To reduce the latency, a simplified SC (SSC) polar 
decoder was introduced in © and it was further analyzed 
in 00 - In the SSC polar decoder, a polar code is converted 
to a binary tree including three types of nodes: rate-one, rate- 
zero and rate -It nodes. Based on the SSC polar decoder, the 
ML SSC decoder makes use of the ML algorithm to deal 
with part of rate-T? nodes in ©■ However, the SSC and ML- 
SSC polar decoders depend on positions of information bits 
and frozen bits, and are code-specific consequently. In fT7) , a 
pre-computation look-ahead technique was proposed to reduce 
the latency of the tree SC decoder by half. For the SCL polar 
decoder, the semi-parallel architecture was adopted in |18j. In 
|19|1 Balatsoukas-Stimming et at. proposed an architecture of 
L = 4 to achieve a throughput of 124 Mbps and a latency of 
8.25 ms when decoding a (1024, 512) polar code. In |20) , Lin 
and Yan designed an SCL polar decoder with the throughput 
of 182 Mbps and a latency of 5.63 ms. To reduce the memory 
requirement, the log-likelihood ratio (LLR) messages are used 
in ©• The throughput of existing polar decoders is still not 
high enough for high speed applications. 
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Since the low throughput (or long latency) of the SC 
decoder is due to its serial nature, several previous works 
attempt to improve the throughput (or latency). In (22) , the 
data bits of a polar code is split into several streams, which 
are decoded simultaneously. This idea of parallel processing 
is extended in (23), where the SC decoder is transformed into 
a concatenated decoder, where all the inner SC decoders are 
carried out in parallel. Yuan and Parhi proposed a multi-bit 
SCL decoder ( 24) . 

In this paper, we address the throughput/latency issue by 
proposing symbol-decision SC and SCL decoders, which are 
based on symbol-wise hard or soft decisions. Since each 
symbol consists of M bits, when M > 1 the symbol- 
decision decoders achieve higher throughput as well as better 
error performance. The proposed symbol-decision decoders 
are natural generalization of their bit-wise counterparts, and 
reduce to existing bit-wise decoders when the symbol size is 
one bit. The main contributions of this paper are: 

• We propose a novel recursive channel combination to 
calculate the symbol-wise channel transition probabil¬ 
ities, which enable symbol decisions in SC and SCL 
algorithms. The proposed recursive channel combination 
also has a lower complexity than simply combining 
bit-wise channel transition probabilities. The similarity 
between the Arikan’s recursive channel transformation 
and our symbol-wise recursive channel combination helps 
to share hardware resources to calculate the bit- and 
symbol-based channel transition probabilities. 

• An M -bit symbol-decision SCL decoder needs to find the 
L most reliable candidates out of 2 A1 L list candidates. We 
propose a two-stage list pruning network to perform this 
sorting function. This pruning network also provides a 
trade-off between performance and complexity. 

• By adopting pre-computation technique (25) , We develop 
a pre-computation memory-saving (PCMS) technique to 
reduce the memory requirement of the SCL decoder. 
Specifically, the channel information memory can be 
eliminated when using the PCMS technique. Moreover, 
this technique also helps to improve throughput slightly. 

• To evaluate the throughput of symbol-decision SC de¬ 
coders, we propose an area efficient architecture for 
symbol-decision SCL decoder^] In our architecture, to 
save the area, adders in processing units are reused to 
calculate the symbol-wise channel transition probability. 
We propose two scheduling schemes for sharing hardware 
resources. We also propose two list pruning network for 
designs with different symbol sizes. 

• We design two-, four-, and eight-bit symbol-decision SCL 
decoders for a (1024, 480) CRC32-concatenated polar 
code with a list size of four. Synthesis results show 
that in terms of area efficiency, our symbol-decision 
SCL decoder outperforms all existing state-of-the-arts 
SCL decoders in (19)-(2T), (24). For example, the area 
efficiency of our four-bit symbol-decision SCL decoder 
is 259.2 Mb/s/mm 2 , which is 1.51 times as big as that 


*We focus on the SCL decoder because the SC decoder can be considered 
as an SCL decoder with a list size of one. 


of ( 21| . Our implementation results also demonstrate that 
the symbol-decision SCL decoder can provide a range of 
tradeoffs between area, throughput, and area efficiency. 


Our symbol-decision decoding algorithms assume that the 
underlying channel has a binary input, and our symbol-wise 
channel transformation is virtual and introduced for decoding 
only. Hence, our work is different from those assuming a g-ary 
(g > 2) channel (see, for example, ©). 

The decoding schedule (bit sequence) of our symbol- 
decision decoding algorithms is actually the same as those 
in (22)-(24), but our symbol-decision decoding algorithms 
are different from those in 1221— (24) in two aspects. First, 


our symbol-wise recursive channel transition is different from 
how transition probabilities are derived in (22)-|24). Sec¬ 
ond, the symbol-decision perspective allows us to prove that 
the symbol-decision algorithms have better frame error rates 


(FERs) than their bit-decision counterparts 1271, while only 


simulation results are provided in [22), [24 
formance is not investigated in 


and error per- 


231. There are additional 
differences between our decoding algorithms/architectures and 
those in j22)-(24). For instance, all the bits within a symbol 
are estimated jointly in our symbol-decision SC algorithm, 
whereas some bits are decoded independently for the decoder 
with parallelism two in (22) . Also, while our symbol-decision 
decoding is introduced on the algorithmic level, the multibit 
decoder is introduced on the level of decoding operations (24). 
Finally, for our symbol-decision SCL decoders, we use the 
semi-parallel architecture because it is more area efficient than 
the tree architecture and the line architecture GD- 

The rest of our paper is organized as follows. Section [II] 
briefly reviews polar codes and existing decoding algorithms 
for polar codes. In Section [IIIJ the symbol-based recursive 
channel combination is proposed to calculate the symbol- 
based channel transition probability. Moreover, to simplify 
the selection of the list candidates, a two-stage list pruning 
network is proposed. In Section IV we introduce a method to 
reduce memory requirement of list decoders of polar codes by 
pre-computation technique. In Section [V] we demonstrate the 
hardware architecture for symbol-decision SCL decoders. Two 
scheduling schemes for hardware sharing are discussed. We 
also propose two list pruning network for different designs: 
a folded sorting implementation and a tree sorting imple¬ 
mentation. A discussion on the latency of our architecture 
and synthesis results for our implementations are provided in 
this section as well. Finally, we draw some conclusions in 
Section [Vi] 


II. Polar Codes and Existing Decoding 
Algorithms 

A. Preliminaries 

We follow the notation for vectors in [ lj, namely u b a = 
(u a ,u a+ 1 ,--- ,Ub~i,Ub)', if a > b, u b a is regarded as void. 
u b a 0 and u b a e denote the subvector of u b a with odd and even 
indices, respectively. 

Let W : X -A y represent a generic B-DMC with 
binary input alphabet X, arbitrary output alphabet y, and 
transition probabilities W(y\x), y £ y, x £ {0,1}. Assume 
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TV is an arbitrary integer and M is an integer satisfying 
M\N. Let W ^ M s denote a set of coordinate channels: 
W$ M : X M ->• y N x 0 < j < § with 

the transition probabilities W^ M (yi , x^~^ M \x^ I _ 1 ^ M+1 ), 
where (y^, Xj~^ M ) and \ )M+\ d enote the output and 
input of W^ M , respectively. 

B. Polar Codes 

Polar codes are linear block codes, and their block lengths 
are restricted to powers of two, denoted by N = 2" for n > 2. 
Assume u = u^ = (ui,u 2 , • • • , mat) is the data bit sequence. 
Let F = [{ 5]- The corresponding encoded bit sequence x = 
x™ = (ari, X 2 , ■ ■ • , Xn ) is generated by 

x = uB N F® n , (1) 

where /Ty is the TV x TV bit-reversal permutation matrix and 
F® n denotes the n-th Kronecker power of F [jl |. 

For any index set A C {1,2, • • • ,1V}, u A = (m : 0 < i < 
TV, i £ A) is the sub-sequence of u restricted to A. For an 
(IV, K) polar code, the data bit sequence is grouped into two 
parts: a I\ -element part u A which carries information bits, and 
u_ 4 = whose elements are predefined frozen bits, where A c is 
the complement of A. For convenience, frozen bits are set to 
zero. 


C. SC Algorithm for Polar Codes 

Given a transmitted codeword x and the corresponding 
received word y, the SC algorithm for an (IV, I \) polar code 
estimates the encoding bit sequence u successively as shown 
in Alg. |T] Here, u = {u\. u 2 > • • • , u.y) represents the estimated 
value for u. 


Algorithm 1: SC Decoding Algorithm |TJ 


1 for j = 1 : TV do 

2 if j £ A' then Uj = 0 

y, 

w *r,i(y>“i _1 |“j=0) 

Uj = 0 


else 

> 1 then Uj = 1 else 


To calculate W^j \(y, u{ _1 |rtj), Arikan’s recursive channel 
transformation [d] is applied. A pair of binary channels 
W 2 ( i7 1} and 144, A 1 are obtained by a single-step transforma- 


(i) 

tion of two independent copies of a binary input channel Wf \ 

: (W^W^) !->■ The channel transition 


probabilities of 1 11 and are given by 


(2 i) 


W 2A,1 1] (yi A ’ U l l 2 | U 2i-l) 

= \ X! [ W Sl(yt U 2 l,o 2 ® u\]e 2 \ U 2 i-l © U 2 i) ( 2 ) 




= \w { l\{yt <o" 2 © ul l - 2 \u 2l -i © u 2 i) (3) 


where 0<T<A = 2 a <TV, and 0 < A < n. 

Expressed in log-likelihood (LL), Eqs. © and © can be 
approximated as |4|: 


T T (2z —1) / 2A 2i—2 1 \ 

LL 2A (y 1 ,U 1 \U 2 i— l) 

A „,2i-2 


max 


ll a ( 2/1 > uf~ 2 0 uf~ z \u 2 i-i 0 0 ) 

+ LL A ^(t/ A A j, 2 |0) , 

LL ^{yt^o 2 © uf~ 2 \u 2i -i 0 1) 

+ LL^ } {yf^j, u?~ 2 11)1} - log 2, 


(4) 


~ lA,f{y^,uf-' 2 ®ul i - 2 \u2i-x®U2i) (5) 

+ LL A ) (2/A+l. u l^ 2 | w 2i) - log 2. 

To simplify the calculation, the constants in Eqs. 0 and 
© can be discarded since this global offset for all LLs does 
not affect the decoding decision. 


D. Parallel SC Algorithm for Polar Codes 


(a) •->•--•->•- 

Bit 1 Bit 2 Bit 3 Bit 4 Bit 5 



(1,2, (M+1,...2M) (2/W+l,...,3/W) 



Bits 


>•-► 

Bit 6 


Fig. 1. Decoding of (a) bit-decision vs. (b) M-bit symbol-decision 


The SC algorithm makes hard-decision for only one bit 
at a time, as shown in Fig. |T]a). We call it bit-decision 
decoding algorithm. A parallel SC decoder J22)-{24| makes 
hard-decision for M bits instead of only one bit at a time, as 
shown in Fig. [T]b). 

Without loss of generality, assume M is a power of two, 

i.e. M = 2 m (0 < m < n). 1M, = {jM - M + 


def 


1 JM - M + 2,--- JM}, for 0 < j < §. AM, = 


TMj A A and AM} = XM, D A c . 

I-M 


Given y and u{ AI 


w 


jM 
'jM-M+l 


is determined by 


~jM 

u jM-M+l 


5 max 

{ 0 , 1 } 


l AM, 


yr/W c v y'M-Mi jM \ 

vv N,M\y^ u l \ u jM-M+lh 


where \AM j represents the cardinality of AMj. If M = 
TV, this decoding algorithm is exactly a maximum-likelihood 
sequence decoding algorithm. 
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Algorithm 2: SCL Decoding Algorithm [fT0| 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


a = 1 ; 

for j = 1 : ./V do 
if j G w4 c then 
for i = 1 : a do 

L (A); = 0; 

else if 2a < L then 
for z = 1 : a do 

(A) 1 =conc((A)i _1 ) 0); 
(A+a)l =COnc((A)l _1 , 1); 
a = 2a; 

else 

for * = 1 : L do 

S[i].P = W^{y, (A)i _1 |0); 

S[*]-L = (A)! -1 ; 

S[i].U = 0; 

S[i + J L].P = ^ ) 1 (y,(A)r 1 |l); 
S[* + L].L= (A)i -1 ; 

|_ S[* + L].U = 1; 

sortPDecrement(S); 

for * = 1 : L do 

j (A)i =conc(S[i].L, S[i].U); 
a = L\ 


23 u = Li; 


E. SCL and CA-SCL Algorithms for Polar Codes 

Instead of making a hard decision for each information bit 
of u in the SC algorithm, the SCL algorithm creates two 
paths in which the information bit is assumed to be 0 and 
1, respectively. If the number of paths is greater than the list 
size L, the L most reliable paths are selected. At the end of 
the decoding procedure, the most reliable path is chosen as u. 
The SCL algorithm is formally described in Alg. [2] Without 
loss of generality, we assume L to be a power of two, i.e. 
L = 2 l . We use L* = ((A) t, (A)i, ■ • - , (A )n) to represent 
the i-th list vector, where 0 < i < L. S is a structure type 
array with size 2 L. Each element of S has three members: P, 
L, and U. The function sortPDecrement sorts the array 
S by decreasing order of P. c=conc(a,b) attaches a bit 
sequence b at the end of a bit sequence a, and the length of 
the output bit sequence c is the sum of lengths of a and b. 

The CA-SCL algorithm is used for the CRC-concatenated 
polar codes. The difference between CA-SCL ED and SCL 
algorithms is how to make the final decision for u. If there is at 
least one path satisfying the CRC constraint, the most reliable 
CRC-valid path is chosen for u. Otherwise, the decision rule 
of the SCL algorithm is used for the CA-SCL algorithm. 

III. M- bit Symbol-Decision Decoding Algorithms 
for Polar Codes 
A. M-bit Symbol-Decision SC Algorithm 

Here, we proposed a symbol-decision SC algorithm, which 
treats M-bit data as a symbol and decodes a symbol at a 


time. Let Z represent the alphabet of all M-bit symbols. The 
symbol-decision SC algorithm deals with the virtual channel 
: Z —> y N x < j < with the transi- 


x 

ft) i 


tion probabilities WpP (yi , z\ i | zf), where (y^,z^ and 
Zj = (ujM-M+ 1)‘" t u jm) denote the output and input of 
WpP, respectively. Actually, WpP is exactly equivalent to 


W^ M if we consider X M as the binary vector representation 
of Z. Therefore, the symbol-decision SC algorithm has the 
same schedule as the parallel SC algorithm in |22)-|24|. 
However, our symbol-decision SC algorithm has a different 
approach, called symbol-based recursive channel combina¬ 
tion, to compute symbol-based channel transition probabilities 


W n!m( Y-UV AU- 


yM-M | 


jM 
■jM-M+1 


), which is our main focus. 


for 

of 


< 


< JV 

M' 


B. Symbol-Based Recursive Channel Combination 

,w i+N _„)B M F^ 

1 

the 

Wat ,V(y> 


symbol-based 


*! ^i-y 

In (22| 
channel 


124 


-iM—M | iM 


the calculation 
transition probability 


iM-M+l 


) is based on the following 


equation, referred to as direct-mapping calculation: 




N,M t 
M — l 


iM—M\ n ,iM \ _ 

\ LL iM-M+l) — 


n w 

j =0 


(i) 


.(A 1 ’ 


+i 




u i+j 




(7) 


where - u \w t , -jv ) is calculated by 

M’ 1 7m' 1 ~ rJ M 

the Arikan’s recursive channel transformations. 

Actually, a symbol-based recursive channel combina¬ 
tion described in Proposition [T] can be used to calculate 

WM fv y iM ~ M \ v iM I 


Proposition 1. Assume that all bits of u are independent and 
each bit has an equal probability of being a 0 or 1. Given 
0 < m < n, N = 2™, M = 2 m , for any 1 < <j) < m, 
0 < A < n, A = 2 A , $ = 2^, and 0 < i < we say that a 
$-bit channel is obtained by a single-step combination 

of two independent copies of a ^ -bit channel ar, d 

write 


(KT/lKT/2^^ 


(*+l) 

2A,$ i 


( 8 ) 


where the channel transition probability satisfies, 


w^(yl A y* iA!+f) = 




$ + l,o ® “i$+ l,e> 


T3+( i + 1 ) 


2A , 
+li 1 


| i$+$ \ 

1 ,eI l,eJ 


(9) 


Similar to the SC algorithm, with the help of the symbol- 
based recursive channel combination, an M-bit symbol- 
decision SC algorithm can be represented by using a message 
flow graph (MFG) as well, where a channel transition proba¬ 
bility is referred to as a message for the sake of convenience. 
This MFG is referred to as SR-MFG. If the code length 
of a polar code is TV, the SR-MFG can be divided into 
(n + 1) stages fSo, Si, - - • , S„) from the right to the left: 
one initial stage So and n calculation stages. For the SC 
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algorithm, all calculation stages carry out the Arikan’s recur¬ 
sive channel transformation. However, for the M -bit symbol- 
decision SC algorithm, in the left-most m calculation stages 
(S n , • ■ ■ , Sn-m+i), called S-COMBS stages, symbol-based 
channel combinations are carried out. For the rest (n — m) 
calculation stages (S„_ m ,--- , Si), called B-TRANS stages, 
the Arikan’s recursive channel transformations are performed. 
The S-COMBS stages use outputs of B-TRANS stages to 
calculate symbol-based messages. 

For [[22j-(24), we refer to the MFG as the DM-MFG which 
also consists of two parts: B-TRANS and DM-CAL. The B- 
TRANS part of the DM-MFG is the same as that of the SR- 
MFG. However, there is only one stage in the DM-CAL part of 
the DM-MFG which performs the direct-mapping calculation. 

For example, as shown in Fig. [2j the SR-MFG of a four-bit 
symbol-decision SC algorithm for a polar code with N = 8 
has four stages. Messages of the initial stage (So) come 
from the channel directly. Messages of the first stage (Si) 
are calculated with Arikan’s transformations. Messages of 
the second and third stages (S2 and S3) are calculated with 
Eq. 0. Stages in the left gray box are the S-COMBS stages. 
Stages in the right gray box are the B-TRANS stages. Fig. [3] 
shows the DM-MFG when the direct-mapping calculation is 
used to calculate symbol-based channel transition probability 
W$(yl\uf) and w£l{y\, uf |uf). Here, 

Vl = u l,o ® u l,e> ^5= M l,e’ 

W\ = Vi ® V 2 = Ui ® U 2 © U 3 ® «4, 

W 2 = V3 ® V4 = U 5 © Uq ® U7 © U S , 

W 3 = V 2 = Us © Ui, 

W4 = V4 = U-j © Us, 

W 5 = V 5 © V 6 = U 2 © U4, 

We = Vt © Vs = Uq © Us, 

W7 = Vq = Zi 4 , 

W S = V 8 = Us- 




Fig. 3. The message flow graph of a four-bit symbol-decision SC algorithm 
for a polar code with a code length of eight by using direct-mapping 
calculation [22) - |24| . 


transition probabilities for u^f^ 1 . Consider the recursive 
symbol-based channel combination. The S-COMBS stages of 
the SR-MFG are indexed as 1 to m from left to right. There 
are 2 n_z (0 < i < m) nodes in the i-th S-COMBS stage and 
each node contains 2 M+l ~ n messages. One addition is needed 
to compute each LL message according to Eq. 0. Hence, the 
number of additions needed by the S-COMBS stages to calcu¬ 


late w\p M 


(y> 


j M—M | j M 


jM-M+1 


) is 


-2\ am o\. 


Actually, if we perform the hardware implementation, the 
worst case - that all bits of a symbol are information bits 
- should be considered. Therefore, the recursive symbol-based 
channel combination can be taken advantage of to reduce 
complexity of calculating the symbol-based channel transition 
probability. 

For the example shown in Fig. H Eq. 0 needs 2 4 (4 — 
1) = 48 additions to calculate log(FFg 1 4 (y®|wf)). With the 
symbol-based channel combination, 4, 4 and 16 additions 
are needed to calculate \og{W^ 2 (yi\v\)), log(W / 4 ^ 1 2 ' ) ( l ;|| 1 ;6^ 
and log(FFg 1 4 (y®|uf)), respectively. Therefore, our method 
needs only 2 4 + 2 x 2 2 = 24 additions, which is only a 
half of those needed by Eq. 0. Table [I] lists the numbers 
of additions needed by our recursive method and direct- 
mapping calculation j22)-]24) when all M bits of a symbol 
are information bits. When M = 8, the number of additions 
needed by our proposed method is 17% of that needed by the 
direct-mapping calculation. 


TABLE I 

The numbers of additions to calculate W^~^ (y, u{ M 

WHEN THE (j + 1)-TH SYMBOL HAS NO FROZEN BIT. 



Proposed method 

Direct-mapping calculation (22|—(241 

M = 2 

4 

4 

M = 4 

24 

48 

M = 8 

304 

1792 


Fig. 2. The message flow graph of a four-bit symbol-decision SC algorithm 
for a polar code with a code length of eight by using the proposed symbol- 
based recursive channel combination. 

For the direct-mapping calculation, Eq. 0 needs (M — 1) 
additions. Therefore, a total of 2'' AMi ' [ {M — 1) additions 
are needed to calculate all LL-based symbol-based channel 


The other advantage of the proposed method to calculate the 
symbol-based channel transition probability is that it reveals 
the similarity between the Arikan’s recursive channel transfor¬ 
mation and symbol-based recursive channel combination. We 
will take advantage of this similarity to reuse adders and to 
save area when computing the bit- and symbol-based channel 
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transition probability in our proposed architecture. In |24j, 
additional dedicated adders are used to calculated the symbol- 
based channel transition probability, which is not area efficient. 

In terms of the error performance, the symbol-decision SC 
algorithm is not worse than the bit-decision SC algorithm 
(27). Fig. 0 shows the BERs and FERs of symbol-decision 
SC algorithms for a (1024, 512) polar codes. SDSC-v denotes 
the (-bit symbol-decision SC algorithm. When M = 2 and 4, 
the FER performance is the same as that of the bit-decision 
SC algorithm. When M = 8, the FER performance is slightly 
better. 



Fig. 4. Error rates of symbol-decision SC algorithms for a (1024, 512) polar 
code. 


Algorithm 3: M-bit Symbol-Decision SCL Decoding 
Algorithm 


1 a = 1; 

2 for j = 1 : $ do 

3 £ = 2 l AM tl; 

4 if 8 == 1 then 

s for i = 1 : a do 

6 L = 


7 

8 
9 

10 

11 

12 

13 


else if a/3 < L then 

= 0 ; 

for k = 0 : /3 — 1 do 

UAMj =dec2bin(/c, \AMj\)\ 

for i = 1 : a do 

t = i + ka\ 

_ (A)i M =conc((£,:)] M_1! , MjM_M+i); 


14 

15 

16 

17 

18 

19 

20 
21 


22 

23 


a = a/3; 


wamj = 0; 

for k = 0 : /3 — 1 do 

u AMj =dec2bin(fc, |AMj|); 

for z = 1 : L do 

t = 2 + kL\ 

s[*].p = 

S[t].L = 

S[t].U = Wj’M-M+i’ 


C. Generalized Symbol-Decision SCL Decoding Algorithm 

Similarly, the symbol-based recursive channel combination 
is also useful for the SCL algorithm. The symbol-decision 
SCL algorithm is more complicate than the SCL algorithm, 
since the path expansion coefficient is not a constant any 
more. In the SCL algorithm, for each information bit, the 
path expansion coefficient is two. But for the M-bit symbol- 
decision SCL algorithm, the path expansion coefficient is 
which depends on the number of information bits in 
an M-bit symbol. The M-bit symbol-decision SCL algorithm 
is formally described in Alg. [3] Without any ambiguity, 0 
represents a zero vector whose bit-width is determined by the 
left-hand operator. The function dec2bin (d,b) converts a 
decimal number d to a 6-bit binary vector. Eq. <0 is used 
to calculate the symbol-based channel transition probability 
corresponding to each list, i.e. W^\y, )■ 

Fig. 0 shows the BERs and FERs of symbol-decision 
SCL algorithms for a (1024, 480) CRC32-concatenated polar 
code with L = 4 where the generator polynomial of the 
CRC32 is 0xlEDC6F41. This CRC32 is also used in all the 
CRC-concatenated polar codes used in the following section. 
SDSCL-i denotes the i-bit symbol-decision SCL algorithm. 
The performances of the symbol-decision SCL algorithms with 
different symbol sizes are almost the same. 


24 

25 

26 
27 


sortPDecrement(S); 

for i = 1 : L do 

j (A)i M =conc(S[(].L,S[(].U); 
a = L; 



Fig. 5. Error rates of symbol-decision SCL algorithms for a (1024, 480) 
CRC32-concatenated polar code with L = 4. 
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D. Two-Stage List Pruning Network for Symbol-Decision SCL 
algorithm 

For the M-bit symbol-decision SCL algorithm, the maxi¬ 
mum path expansion coefficient is 2 M , i.e. each existing path 
generates 2 M paths. Therefore, in the worst-case scenario, the 
L most reliable paths should be selected out of 2 AI L paths. 
To facilitate this sorting network, we propose a two-stage list 
pruning network. In the first stage, the q most reliable paths 
are selected from up to 2 M paths that come from expansion of 
each existing path. Therefore, there are qL paths left. In the 
second stage, the L most reliable paths are sorted out from 
the qL paths generate by the first stage. The message flow of 
a two-stage list pruning network is illustrated in Fig. [6] 



Fig. 6. Message flow for a two-stage list pruning network. 


If q > L, the L paths found by the two-stage list pruning 
network are exactly the L most reliable paths among the 2 M L 
paths. When q < L, the probability that the L paths found 
by the two-stage list pruning network are exactly the L most 
reliable paths among the 2 AI L paths decreases as well. This 
may cause some performance loss. But a smaller q leads to a 
two-stage list pruning network with lower complexity. 



Fig. 7. Error rates of the SDSCL-8 decoder for a (1024, 480) CRC32- 
concatenated polar code with L = 4. 


Fig. 0 shows how different values of q affect the error 
performance of an SDSCL-8 algorithm for a (1024, 480) 
CRC32-concatenated polar code with L = 4. When L = 4 and 
<7 = 2, the SDSCL-8 algorithm shows an FER performance 
loss of about 0.25 dB at an FER level of 10 -3 . As shown 



Fig. 8. Error rates of the SDSCL-4 algorithm for a (2048, 1401) CRC32- 
concatenated polar code with L — 4 and L = 8. 


in Fig. [8] for a (2048,1401) CRC32-concatenated polar code, 
the two stage list-pruning network of q = 4 helps to reduce 
the complexity of the SDSCL-4 decoder without observed 
performance loss when L = 8. When q = 2 and L = 8, 
the SDSCL-4 decoder has a performance degradation of about 
0.1 dB at an FER level of 10 -3 , compared with the SDSCL-4 
decoder with q = 8 and L = 8. If L = 4, the error performance 
due to q = 2 is very small. 

Therefore, the two-stage list pruning network uses an ad¬ 
ditional parameter q to introduce different trade-offs between 
error performance and complexity. 

IV. Pre-Computation Memory-Saving Technique 

Pre-computation technique was first proposed in ]25) and 
can be used to improve processing rate when the number 
of possible outputs is finite. In the pre-computation 

technique is used to improve the throughput of the line SC 
decoder with an additional cost of increased area. Here, our 
main purpose is to use the pre-computation technique to reduce 
the memory required by list decoders because the memory of 
an SCL decoder to store the channel transition probability 
becomes a big challenge as the list size and code length 
increase. Henceforth, this memory saving technique is called 
the pre-computation memory-saving (PCMS) technique. It is 
worth noting that this memory-saving technique is independent 
of the decoder architecture and the message representation of 
SCL decoders. 

Let us take the MFG shown in Fig. [2] as an example. 
For stages So and Si, the numbers of pairs of LLs stored 
by the list decoder are 8 and 4 L, respectively. Actually, the 
outgoing message {yfwf of the top black node in Si 
can only be either |0) or The outgoing 

message W^l(yf, wi\wi) can only be one of W^(y1, 0|0), 

W 2 2 i{ViA 1 )’ W / 2 , 2 i(yi) 1 l 0 ). and Hence, no 

matter what the list size is, the total number of possible values 
of outgoing messages of Si is 2 x 4 + 4 x 4 = 24. These 
24 values provide all information we need for calculations of 
















































further stages. With knowledge of these 24 values, channel 
LLs are not needed any more. 

Generally speaking, the PCMS technique takes advantage of 
the relationship between messages of So (channel LLs), and 
outgoing messages of Si. By storing only all possible outgoing 
messages of Si, the PCMS technique helps list decoders save 
memory. 

Let us evaluate the memory saving of the PCMS technique, 
assuming LL representation is used for the channel transition 
probability. Without PCMS technique, a list decoder for a polar 
code with the code length of N has a list size of L stores 
(N — 2 )L + N LL pairs. Each pair contains two messages 
which are associated with the conditional bit being zero or 
one. The total number of bits used for LL storage is 

log N — 1 

B LL = 2(NQ ch + L Y, T (Q c h + log N-i)) 

i—l 

= 2(L + 1 )NQ ch + 4 L(N - log N - Q ch - 1), 

where Q r j, denotes the number of bits used for the quantization 
of the channel LLs. 

With the PCMS technique, the total number of LL pairs 
needed by a list decoder is ^-L + |7V. The total number of 
bits needed for LL storage is: 

N 

S PCMS =2(y + N )(Qch + 1)+ 

log N-2 

2 L V 2® (Q c h + log TV — i) 

(ID 

=3N(Q c h + 1) + LN(Q c h + 3) 

— 4L(log N + Q c h + 1) 

=B] t - N(LQ ch + L — Q ch — 3). 

Therefore, when LL representation is used for messages, 
the PCMS technique saves N ( LQ c h + L — Q c h ~ 3) bits of 
memory. The saving is linear with both N and L. Consider 
a polar code with N = 1024, a list decoder with L = 4 and 
Qch = 4. Without the PCMS technique, B LL = 57104. With 
the PCMS technique, -BpcMS = 43792. The PCMS technique 
helps to save 13312 bits of memory, which is 23% of B]^. 

The other advantage of the PCMS technique is that it 
improves the throughput slightly because the messages of Si 
are already in the memory and don’t need to be calculated 
from the channel messages. For example, for a bit-decision 
semi-parallel SCL decoder with the list size of L , if the code 
length is N and the number of processing units is P, the 
latency saving due to the PCMS technique is clock cycles. 

V. Implementation of Symbol-Decision SCL 
Decoders 

A. Architecture of Symbol-Decision SCL Decoders 

We propose an architecture of an M -bit symbol-decision 
SCL decoder shown in Fig. [9] It consists of M MPU blocks 
(MPUo, MPUi, • • • , MPU.vf-!), a list pruning network (LPN), 
a mask bit generator (MBG), a message-screening block 
(MSNG), a control block (CNTL), an output-list generator 
(OLG) and a CRC checker (CRCC). 



Fig. 9. Top architecture for an M-bit symbol-decision SCL decoder. 


An MPU block calculates messages for B-TRANS and S- 
COMBS messages and updates the partial-sum network by 
adopting blocks of the SCL decoder in |20) . The additions of 
S-COMBS stages are carried out by reusing the same hardware 
resource which is used to calculate messages of B-TRANS 
stages to reduce the area. Compared with the SCL decoder 
in |[20j, the MPU has neither path pruning unit nor the CRC 
checker. The other improvement for the MPU is that PCMS 
technique is used here. The architecture of an MPU is shown 
in Fig. [lO] Channel messages are not needed any more due 
to the adoption of PCMS technique. L-MEM stores messages 
corresponding to stages of the MFG. For the stage Si, MSEL 
selects the appropriate messages from L-MEM based on partial 
sum values and/or the type of calculation nodes. PUs are 
processing units to calculate LL messages. PSUs is used to 
update partial-sums. ISel selects messages from LMEM or 
OSel module for the crossbar (CB) module which chooses 
proper messages for PUs. OSel outputs messages to L-MEM 
for intermediate stages and output symbol-based messages to 
MSNG. 



Fig. 10. Architecture of an MPU. 


We take the MFG of Fig. [2] as an example to illus¬ 
trate the function of block MSEL. For node f 2 1 of path 

and 

are selected from LMEM by MSEL and output to Isel. 
For node g 2 1 of path /, 

and {W 2 2 l(y%,w 3l \0),W.^(y$,w 3l \l)}i are selected from 
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LMEM. Here, wu and w:u are the partial sum for w 1 and W 3 , 
respectively, belonging to path l. The detailed information of 
other blocks in Fig. [TO] can be found in J20) and will not be 
discussed in this paper. 

The message-passing scheme in MFG of a polar code is 
in a serial way, which means that the calculation of a stage 


depends on the output of its previous stage. The PUs in [20| 
only carry out the B-TRANS additions. On the other hand, 
the S-COMBS stages need only additions and a processing 
unit has four adders. Therefore, in order to save hardware 
resources, the adders in the processing units is reused to 
calculate the symbol-based channel transition probability, after 
these processing units finish calculations for the B-TRANS 
stages. In other words, additions of both the B-TRANS and 
the S-COMBS stages are folded onto the same adders in the 
processing unites. As shown in Fig. 


11 c[0] and c[l] 


are 


outputs for the B-TRANS stages; d[0],d[l], d[2\, and d[3] are 
outputs for the S-COMBS stages. 



J J,2 



Im 

bi.M 


Fig. 12. Architecture for generating a mask bit. 


to as BS_L. The folded sorting implementation needs 2 M_1 
BS_Ls (BS_Lo, BS_Li, • • • , BS_L 2 m-i_ 1 ). The outputs of 
the BS_L 2 i and the BS_L 2 j_|_i(0 <i< 2 M ~ 2 ) are connected 
with inputs of BS_L, through registers and multiplexers. For 
the tree sorting implementation with 2 M L inputs, 2 A1 — 1 
BS_Ls are needed. The tree sorting implementation can be 
divided into M layers. For 0 < i < M, there are 2* BS_Ls 
in the i-th layer. Inputs of the BS_Ls of the i-th layer are 
connected with outputs of the BS_Ls of the (i + l)-th layer. 


Fig. [T3] and 14 show examples of the folded and tree sorting 
implementations, respectively, for 2 A 1 = 8. 



Fig. 13. Architecture for the folded sorting implementation when 2 M = 8. 


Fig. 11. Architecture of a processing unit. 

Block MBG provides a mask bit for each path. If there 
are / ( fgeqO ) frozen bits in the M-bit symbol, the number 
of expanded paths will be 2 M ~*. For hardware implementa¬ 
tions, we need to consider the worst case and all messages 
corresponding to 2 M possible paths are calculated. Each path 
is associated with a mask bit. When some paths are not 
needed, due to frozen bits, they are turned off by mask bits. 
Fig. [j~2| shows how to generate the mask bit for path i, where 
i = (*i, * 2 , • - • 4 m) £ {1,0} M (0 < i < 2 m - 1) and 
b, = (bj 1 , bi o., ■ ■ ■ ,bj t M) is a frozen-bit indication vector 


for vijM+i ■ If UjM+t is a frozen bit, t = 1. Otherwise, 
bj tt = 0. If b j is an all-one vector, all bits of are 

frozen bits, called an M-bit frozen vector. If Mask_biti is 1, 

jM+M 


for these two implementations is a bitonic sorter [281 , which 
outputs the L max values out of 2 L inputs. It is referred 



u jM+i * s impossible to be i and the message corresponding 
to Ujjy-f f 7 = i is set to 0 in block MSNG. 

Block LPN receives 2 M L messages from block MSNG, 
finds the most reliable L paths, and feeds decision results 
back to the MPUs. Here, we use two different sorting im¬ 
plementations - a folded sorting implementation and a tree 
sorting implementation - for different designs. The basic unit 


Fig. 14. Architecture for the tree sorting implementation when 2 A1 = 8. 

The folded sorting implementation has a smaller area than 
the tree sorting implementation. However, the pipeline can 
be applied to the tree sorting implementation by inserting 
registers between layers to improve the throughput of the tree 
sorting implementation. 

For the two-stage list pruning network proposed in 
Sec. HI-D| either the folded sorting implementation or the tree 


sorting implementation can be used for the 2 -to-g sorting 
function and the qL-to-L sorting function. 

Block CNTL provides control signals to schedule the hard¬ 
ware sharing for MPUs and decides when to start pruning 
paths. The signal frz llag is an indicator which is one when 
a frozen vector appears. When frz flag is one, all MPUs use 
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zero to update the partial-sums instead of outputs of the LPN. 
In this case, the LPN, the MSNG, and the calculation of 
S-COMBS stages are bypassed. The OLG stores the output 
paths. The CRCC checks if a path satisfies the CRC constraint. 

B. Message Scheduling and Latency Analysis 

To improve area efficiency, for different number of PUs, 
different scheduling schemes are needed. To reuse the adders 
of the processing units, the additions of the S-COMBS stages 
in the MFG must be scheduled properly. Assume the number 
of the processing units is P. The total number of the adders 
provided by processing units is 4 P. If 2 M L < 4 P, we use a 
serial scheduling, which means that there is no overlap for the 
processing units and the LPN in terms of the operation time, 
as shown in Fig. [15] 



Fig. 17. A pipelined tree sorting implementation for the overlapping 
scheduling. 


S-COMBS stages, and the latency of the list pruning network. 
Tb represents the overall number of clock cycles for the 
calculations of the B-TRANS stages. It is equivalent to the 
latency of a bit-decision SCL decoder with a code length of 
and jj processing units: 


b<- T s -» 

«- t n ->• 


y////sW////z. : 

B-TRANS 

S-TRANS 


Si I Sz | ... | s n . m 

Sn-m+l | S n - m +2 | ••• | S n 


Fig. 15. Serial scheduling (in clock cycles). 


Suppose each addition takes one clock cycle. Then each S- 
COMBS stage takes one clock cycle to compute messages. 
Therefore, it takes m clock cycles for the S-COMBS stages 
to output messages to the LPN. To save the area, the folded 
sorting implementation is applied for the serial scheduling. 

When 2 m L > 4 P > 2 M < 2 L, there are not enough 
adders to calculate all 2 M L messages of the stage S„ in 
one clock cycle, but all 2 M / 2 L messages of the stage S,; 
(n + TO—l<i<n — 1 ) can be calculated in one clock cycle. 

qM T 

Without increasing the number of adders, cycles are 

needed. In each cycle, 4 P messages are calculated. To reduce 
the latency, the overlapping scheduling shown in Fig. [16] is 
used. In clock cycle Co, the first 4 P messages come out. In 
clock cycle ci, the LPN starts work. Therefore, the MPUs 
and the LPN are working simultaneously for P-Z — 1 clock 
cycles. Here, the LPN works in a pipeline way. Hence, the 
tree sorting implementation is deployed for the overlapping 
scheduling and a BSJL is connected at the end of the tree 
sorting implementation in a way shown in Fig. m where 
the number on a line represents the number of messages 
transmitted through the line. 


1 

! 


B-TRANS 

S-TRANS 


Si 1 ... 1 s„. m 

Sn-m+l | S n - m +2 | 

S n 


c 0 | Ci | ... 


Clock cycles when the processing units are busy. 


Y7A 


Clock cycles when the LPN is busy. 

Clock cycles when both the processing units and LPN are busy. 


Fig. 16. Overlapping scheduling (in clock cycles). 


The latency of an M-bit symbol-decision SCL decoder 
consists of: the latency for calculating messages of the B- 
TRANS stages, the latency for calculating messages of the 


rr _o N , NL / M ,NL/M^ NL/M 
B ~ 2 M + P/M l0g2 ^ 4 P/M ' ~ P/M 


where the third term, — N p/^f , is the latency saving by using 
PCMS technique. Tg represent the number of clock cycles 
for the calculations of S-COMBS stages per symbol. 7’y 
represents the number of extra clock cycles per symbol needed 
by the LPN to finish the list pruning after all messages of the 
stage S n are calculated. If 2 M L < 4P, the number of clock 
cycles used to calculate messages for S-COMBS stages is 
T s = m. When 2 M L > 4P > 2 M / 2 L, Tg=m-l +\. 

More generally, Tg < Y^iLi\\p L ~\- T/v is determined by the 
detailed implementation. Hence, the latency of the symbol- 
decision SCL decoder is: 


N 

T{M) = {l- 1 )—{Tg + T N ) + T B 

, . N, . N NL , ,NL. 

- (1 - 7) M (Ts + Tv) + 2 - + — log 2 ( —) 


( 12 ) 


where 7 is a ratio of the number of frozen vectors to jj. 

Table [D] shows the latencies (in clock cycles) for different 
decoders to decode a (1024, 480) CRC32-concatenated polar 
code with 64 processing units and L = 4 . We assume a BSJL 
needs one clock cycle to find the four maximum values out 
of eight values. For M = 2 and M = 4 , a folded sorting 
implementation and the serial scheduling are used. For M = 8 , 
a pipelined tree sorting implementation and the overlapped 
scheduling are applied. For M = 8 and q = 2, the basic unit 
in the tree sorting implementation is to find the two maximum 
values out of eight values, which needs one clock cycles. 
Therefore, 7’y = 4 when M = 8 and q = 2. 


TABLE II 

Latencies for different decoders for a (1024, 480) 
CRC32-CONCATENATED POLAR CODE WITH 64 PROCESSING UNITS AND 

L = 4. 


Decoder 

7 

Ts 

t n 

q 

Latency (# of cycles) 

SDSCL-2 

0.445 

1 

2 

4 

2069 

SDSCL-4 

0.395 

2 

4 

4 

1634 

SDSCL-8 

0.344 

6 

7 

4 

1540 

SDSCL-8 

0.344 

6 

4 

2 

1288 
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It is claimed in (22j| that the M- bit SDSCL decoder could 
have M times faster decoding speed than the bit-decision 
SCL decoder, which is much better than our implementation 
results. Let us review Eq. ( fl2| again. For a fair comparison, 
suppose the MPUs of the M- bit SDSCL decoder has the 
same architecture as the conventional SCL decoder. Then a 
conventional SCL decoder with the PCMS technique has a 
latency of T(l) = 2 N + ^ log 2 pp- The decoding speed 
gain of the M-bit SDSCL decoder is 

T( 1) _ 2AT + ^p log 2 pp _ 

T(M) (l^ 7 )f(T s + T JV ) + 2f+ ^log 2 (|^) 

(1 ^ ^NjTs + T n ) + (M - 1) ^ log 2 ^ 
(l- 7 )f(T s +T w ) + 2f + ^log 2 ^ 

(13) 

To be exactly M (M > 1) times faster, (1 — 7 )^(Tg + 
Tjv) + (M — 1)^ log 2 pp should be zero. For NL > 8P, 
t(m) < because Tg > 0 and T)v > 0. For NL = 8 P, 
Tg = Tn = 0 should be satisfied, which means that the 
calculation of the symbol-based channel transition probability 
and the list pruning procedure do NOT take any clock cycle. 
This is impractical. However, Tg and Tn cannot be zero in a 
practical design. If NL < 8 P and P < NL, to achieve M 
times faster, (T R +T N ) = (M ~ ( 1 1 ) ^°p 771 < Usually, 

(1 — j)N >> 5(M — 1). Therefore, the statement about the 
decoding speed gain in [22J is too idealistic to be achieved 
in practice because the practical implementation needs some 
extra cycles to calculate the symbol-based channel transition 
probability and to perform the list pruning function. 


C. Synthesis results 

To implement the proposed symbol-decision SCL decoder, 
we consider only M = 2,4 and 8 . For M > 16, it is 
impractical to build list pruning networks. For example, for 
the worst case of M = 16 that all the bits of a symbol are 
information bits, there are 2 16 L = 65536L paths. Even if 
L = 1, to find the maximum value among 65536 values still 
needs a huge amount of hardware resources and leads to a 
huge latency. 

In our implementations, L = 4. Each implementation has 
64 processing units. LL messages are used in our designs. 
The channel LL messages are quantized with 4 bits. A (1024, 
480) CRC32-concatenated polar code is used. The synthesis 
tool is Cadence RTL compiler. The process technology is 
TSMC 90nm CMOS technology. Our proposed architectures 
are compared with the state-of-the-arts SCL architectures, in 
(24j |, both bit- and symbol-decision algorithms. The 
synthesis results in (21) and (20) are also based on a TSMC 
90nm CMOS technology. The original synthesis results of [ 191 
and (24) are based on a UMC 90nm and ST 65nm CMOS 
technologies, respectively. 

The synthesis results shown in Table III demonstrate that 
our symbol-decision SCL polar decoders have higher area 
efficiencies than the SCL decoders in I19|, @, (24), and 
m- The SCL decoders in |2TJ, (24) have higher clock 
rates than our designs because it uses registers as storage units. 
However, in our designs, register files are used. 


The SDSCL -8 decoders provide a higher throughput and 
a smaller latency than the SDSCL-2 and SDSCL-4 decoders, 
and occupy larger areas. However the improvements on the 
throughput and latency are not linear in the symbol size. 

Compared with the SCL decoder in (20) , the increase of 
areas of symbol-decision SCL decoders is mainly due to 
sorting networks because the adders of processing units are 
reused to calculate both the bit- and symbol-based channel 
transition probabilities. For the SDSCL-4 decoder, because the 
sorting network of the SDSCL-4 decoder is only 0.073 mm 2 , 
there is no need to shrink q further. For the SDSCL -8 decoders, 
when g = 4, the area of the sorting network is 0.454 mm 2 . 
However, when q = 2, the sorting network occupies 0.196 
mm 2 which is less than a half of that of q = 4. A smaller q 
does help the SDSCL -8 decoder achieve a higher throughput, 
a smaller latency, a smaller area, and a higher area efficiency, 
but it also introduces an FER performance loss of 0.25 dB to 
the SDSCL -8 decoder at an FER level of 10 -3 as shown in 

Fig- 0 

Moreover, we also provide synthesis results for SDSCL -8 
decoders without the PCMS technique. The PCMS technique 
helps the SDSCL -8 decoders gain an area saving of about 0.12 
mm 2 . 

We’ve already mentioned that LL messages are used in our 
designs. If LLR messages (2l) are used, symbol-decision SCL 
decoders can have better area efficiencies than our current 
designs because the memory requirement for LLR messages 
are fewer than that for LL messages (2TJ- 


VI. Conclusion 

In this paper, we use the symbol-based recursive channel 
combination to calculate the symbol-based channel transition 
probability. We show that based on the LL representation of 
the transition probability, this recursive procedure needs fewer 
additions than the method used in (22), (24). Furthermore, 
a two-stage list pruning network is proposed to simplify 
the /.-path finding problem. We use the PCMS technique to 
reduce the memory requirement for list decoders. By applying 
the PCMS technique, we design an efficient architecture for 
symbol-decision SCL decoders. Specifically, we introduce two 
scheduling schemes to perform the hardware sharing. A folded 
sorting implementation and tree sorting implementation are 
also discussed. We also implement symbol-decision SCL polar 
decoders for two-bit, four-bit and eight-bit, respectively, with 
a list size of four. Our synthesis results show that symbol- 
decision SCL polar decoders outperform existing SCL polar 
decoders in terms of the area efficiency. Our proposed methods 
and architecture provide a range of tradeoffs between area, 
throughput and area efficiency. 
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TABLE III 

Synthesis results for different decoders with L = 4. 



Proposed Architectures | |24| 

(20) | [19]f | [ 19JT | (21] 

Algorithm 

Symbol-decision SCL 

Bit-decision SCL 

M 

2 | 4 | 8 || 2 | 4 

N/A 

Message Type 

LL 

LLR 

Clock Rate (MHz) 

500 

525 

379* 

400 

289* 

500 

694 

314 

794 

Latency (us) 

4.14 

3.27 

3.08 

3.21" 

2.58 

2.70" 

3.89 

5.39* 

2.56 

3.53* 

5.63 

4.06 

8.25 

3.34 

Throughput (Mbps) 

247 

313 

332 

319“* 

398 

379*“ 

262 

OO 

NO 

401 

289* 

182 

252 

124 

307 

Area (mm 2 ) 

1.126 

1.209 

1.669 

1.782" 

1.403 

1.519** 

1.98 

3.79* 

2.14 

4.10* 

1.099 

2.197 

3.53 

1.78 

Area eff. (Mb/s/mm 2 ) 

219.4 

259.2 

199.2 

179.1" 

283.3 

249.3" 

132.3 

49.9* 

187.3 

70.6* 

165.6 

114.7 

35.1 

172 


t The synthesis result in [ 19) is based on a UMC 90nm CMOS technology. 

' The synthesis result is provided by the authors of |19| based on a TSMC 90nm CMOS technology. 

* Original synthesis results in 124] are based on an ST 65nm CMOS technology. For a fair comparison, synthesis results scaled to a 90nm technology 
are used in the comparison. 

** The design is without the PCMS technique. 


Appendix 


Proof of Proposition (7) According to the definition of 
conditional probability Pr(f3|A) = 
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Because all bits of u are independent and each bit has an equal 
probability of being a 0 or 1, 


PrKl+f-Vw) = PrKl+f- 1 ) = 2-^-P. 


Therefore, 


W^l\y\ K 




i$+l ) 

= 2 


2A u i<S>+<S>-l 




(15) 


According to Eq. 

1 / \ 

= ^K~\yt<l +,S> - 2 © © «*+*) 

(16) 


Similarly, we have 
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Then, by equations CD 
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